1 

Distributed Learning of Gaussian Graphical 
Models via Marginal Likelihoods 

Zhaoshi Meng\ Dennis Wei\ Ami Wiesel^, and Alfred Hero III^ 

^University of Michigan, Ann Arbor, MI, USA 
^The Hebrew University of Jerusalem, Jerusalem, Israel 

Abstract 

We consider distributed estimation of the inverse covariance matrix, also called the concentration 
matrix, in Gaussian graphical models. Traditional centralized estimation often requires iterative and 
expensive global inference and is therefore difficult in large distributed networks. In this paper, we 
propose a general framework for distributed estimation based on a maximum marginal UkeUhood (MML) 
approach. Each node independently computes a local estimate by maximizing a marginal likelihood 
defined with respect to data collected from its local neighborhood. Due to the non-convexity of the MML 
problem, we derive and consider solving a convex relaxation. The local estimates are then combined into 
a global estimate without the need for iterative message-passing between neighborhoods. We prove that 
this relaxed MML estimator is asymptotically consistent. Through numerical experiments on several 
synthetic and real- world data sets, we demonstrate that the two-hop version of the proposed estimator 
is significantly better than the one-hop version, and nearly closes the gap to the centralized maximum 
likelihood estimator in many situations. 

I. Introduction 

Graphical models provide a principled framework for compactly representing dependencies among 
random variables |[l|, |[2|. The ability to perform distributed and efficient inference makes them especially 
well-suited to large networks, e.g., sensor networks, social networks, gene regulatory networks, and smart 
grids ||3|-|[5|. Estimating graphical model parameters is therefore an important first step in enabling these 
applications. 

This research was supported in part by ARO grant W9I INF- 1 1-1-0391. 
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For Gaussian graphical models (GGM), parameter estimation essentially reduces to covariance estima- 
tion and is further simplified by the graph structure: the pattern of edges in the graph corresponds to the 
sparsity pattern of the inverse covariance matrix, also known as the concentration or precision matrix. 
When the sparsity pattern is known a priori, the traditional technique for inverse covariance estimation 
is maximum likelihood (ML), resulting in a convex optimization that can be solved either by generic 
solvers or specialized algorithms ||6|-||8|. However, solving the exact ML estimation problem requires 
either expensive data communication (e.g., to a fusion center) and centralized computation/storage, or 
in a distributed setting, iterative message -passing in loopy graphs to compute derivatives |j9|. Neither 
of those strategies are feasible in large, generally structured and distributed networks, such as a sensor 
network where the corresponding graph could be very loopy and the computational resources are highly 
decentralized. 

Given the limitations of centralized solutions, researchers have begun proposing distributed methods 
for graphical model learning, where agents in the network estimate local parameters by processing local 
data with limited communication between agents |[3|, Q. In addition to decentralizing computation and 
communication across the network, distributed solutions are naturally robust against localized attacks 
or failures, and are also more protective of data privacy. These advantages make distributed algorithms 
particularly appealing for network applications. 

This paper proposes a general framework for distributed estimation based on marginal likelihoods. Each 
node collects data within its neighborhood and independently forms a local estimate by maximizing a 
marginal likelihood. We show that this marginal likelihood approach is sufficient in principle to determine 
the global parameter. Notably, the formation of our global estimate from local estimates does not require 
iterative consensus or message-passing as in ||3|, Q. Due to the non-convexity of maximum marginal 
likelihood (MML) estimation, we derive and consider solving a convex relaxation of the problem. We 
prove that the relaxed MML estimator is asymptotically consistent. We then focus on two specific cases in 
which the local neighborhoods correspond to either one or two communication hops through the network. 
Extensive experiments show that the two-hop version of the proposed estimator is significantly better than 
the one-hop version and nearly closes the gap to the centralized ML estimator in many situations. 

The outline of the paper is as follows. In Section [IT| we give a brief review of graphical models 



and centralized ML estimation. In Section III we propose a general approach to distributed estimation 
based on marginal likelihoods, show asymptotic consistency, and discuss two important specific cases. 
Numerical experiments are presented in Section IV and the paper concludes in Section [V] 
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A. Related work 

Recent works by |[3|, Q have discussed a similar but less general framework for distributed estimation 
in which several pseudo-likelihood-based estimators for GGMs and general exponential families are 



proposed. In Section III-D we show that the estimators in [4] can be seen as a special case of the 
proposed relaxed MML estimator, and a significant performance improvement can be obtained through 
a reasonable increase in the computation and communication cost. 

Our distributed estimator is formed through solving multiple relaxed MML estimation problems by 
local agents. This MML formulation contrasts with the pseudolikelihood and composite-likelihood for- 



mulations 1 10 1 in which a global loss function is optimized. The structure of each local MML problem 



is related to the latent graphical model considered in |11|. However, the assumption in |11| is that the 
number of latent variables is relatively small, thus contributing a low-rank term, whereas in our case, all 
variables outside the local neighborhood can be seen as latent variables. 

Another line of work |[8|, |12|-|14| focuses on the problem of covariance selection, where the graph 
topology is not known a priori and must be estimated in addition to the parameters. This appears to be a 
harder problem than ours since we assume the structure is known, but some insights regarding distributed 
algorithms in particular can perhaps be shared. 

Notation. Boldface upper case letters denote matrices and boldface lower case letters denote column 
vectors. Sets of single indices are denoted by calligraphic upper case letters. The cardinality of a set 
A is denoted by |^| and the difference of two sets is denoted as A\B. Following common notation, 
^M,Af represents a submatrix of A with rows indexed by Ai and columns indexed by M. We also make 
reference to irregular sets of index pairs such as the edge set E of a graph, for which we use standard 
upper case letters. A^; then refers to the vector of entries of A indexed by E. The standard inner product 
between two symmetric matrices is denoted as (A,B), i.e., 

(A,B) = trace(AB) = ^ AjjEij. 
IL Background 

A. Graphical models 

We refer the reader to |[T| for a detailed treatment of graphical models. We consider a p-dimensional 
random vector x following a graphical model with respect to an undirected graph Q = {V,E), where 
y = {1, ... ,p} is a set of nodes corresponding to elements of x and -E is a set of edges connecting 
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nodes. The vector x satisfies tlie Markov property witli respect to G if for any pair of nonadjacent nodes 
in Q, the corresponding pair of variables in x are conditionally independent given the remaining variables. 

For Gaussian graphical models (GGM), the vector x follows a multivariate Gaussian distribution. We 
assume without loss of generality that x has zero mean. Then the probability density function can be 
written in canonical form in terms of the concentration matrix J as follows: 

p(x; J) = (27r)-P/2(^g^ j)i/2g^p l^-ix^Jx^ . (1) 

The Markov property manifests itself in a simple way through the sparsity pattern of J: 

Jij = for all ^ E. (2) 

This property leads to efficient inference algorithms for GGMs, and as we will show, it can also be 
exploited for distributed parameter learning. 

B. Centralized estimation in GGMs 

Maximum likelihood estimation (MLE) is a classical approach to estimating model parameters from 
data. For Gaussian graphical models, the problem reduces to estimating the non-zero elements of the 
concentration matrix J. These elements are indexed by E, the union of the edges and pairs corresponding 
to diagonal elements, 

E:=EU{{i,i)Y,^^. (3) 

When all the data are accessible, the centralized global maximum likelihood (GML) estimation problem 
can be formulated as |[lj 

jGML ^ argmin (S, J) - logdet J 
J 

s.t. Jj,fe = y{j,k)iE, (4) 
J ^ 0. 

where 



S = i^x(t)x(i)^ 



t=i 

is the sample covariance matrix and x(l), . . . ,x(T) are i.i.d. samples of x. Since the GML problem 
([4]) is convex in J^, efficient gradient-based algorithms can be applied, many of which have specialized 
implementations on graphs, e.g. iterative proportional fitting (IPF) ||2j, chordally-embedded Newton's 
method ||7|, and an iterative regression method introduced in |15|. However, as mentioned before, the 



main drawback of these methods is the computational and communication complexity when implemented 
in networks. 
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(a) 2D lattice and two-hop neighborhood (b) A general graph and two-hop neigh- 
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Fig. 1. Illustration of defined sets and local relaxation. In (a) and (b) we show two different graphs, in which the two-hop 
neighborhood N for node i is indicated with dashed contours. The buffer set variables xg and the protected set variables x-p 
(excluding node i itself) are colored blue and red, respectively. For the graph in (b), we illustrate the one-hop and two-hop local 
relaxation problems in (c). The dashed red lines in (c) denote the fill-in edges due to relaxation. 



We now consider a general approach to distributed parameter estimation motivated by network ap- 
plications. Each random variable Xj is associated with a node in the network. The topology of the 
graph Q, which encodes statistical dependences, is assumed to coincide with the topology of internode 
communication. Each node collects all the data samples from within a neighborhood and computes a 
local parameter estimate. A global estimate of J is then formed by combining these local estimates. The 
key components in this framework are the choice of neighborhood, the local estimation method, and the 
method for combining estimates. 

A. Marginal Likelihood Maximization 

In this paper, we propose to estimate local parameters by maximizing marginal likelihood functions. 
For each node i, define the index set for its immediate neighbors as 



and consider a neighborhood indexed by a set Mi containing at least the node i itself and its immediate 
neighbors X^. Let K denote the concentration matrix corresponding to the marginal distribution over the 
variables {xj,j G TVj} in the neighborhood, and let 



III. Distributed Estimation in GGMs 



^i-={3 I {hj)(^E] 



(5) 




t=i 
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be the local sample covariance matrix. Then the maximum marginal likelihood (MML) estimation problem 
in neighborhood A/i can be formulated as a joint optimization of K and J: 



s.t. K 



K,J 

-1 



(6) 
J ^ 0, 

where the first constraint represents the marginalization relationship between K and J and the second 
line of constraints reflects the sparsity of the global parameter. 

A global estimate of J can be formed by combining optimal solutions to problem ([6]) for each node 



i. We consider a simple concatenation approach whose justification will be given in Section III-B below. 
Denote by Li the set of index pairs corresponding to the non-zero entries in the ith row of J, i.e., 

Li:={ij,k) GE \ j = i}. (7) 

Note that the self-edge pair is included in Lj given the definition of in ([3]). We refer to the 
parameters indexed by Li as the row parameters for node i. The global estimate is constructed by 
extracting the row parameters from each local estimate and concatenating them: 

jMML ^ j^^^MML^ fori = l,...,p. (8) 

Note that this construction of the global estimate J'^'^l ^^^^ ^^so the other surrogate estimator proposed 
below) does not guarantee symmetry, since it is not a major concern of this work. However, as will shown 



in Section III-C the symmetry is naturally recovered in the asymptotic limit; also it can be imposed in 



the non-asymptotic case through a simple strategy (see Section III-Fl. 

The difficulty with the MML approach is that problem Q is in general a non-convex optimization. 
The non-convexity arises from the coupling of the nonlinear marginalization constraint linking K to J 
and the sparsity constraints on J. In the next subsection, we derive a convex relaxation of the MML 
estimation problem as a surrogate. 

B. Convex Relaxation of Marginal Likelihood Maximization 

In the remainder of the section, we suppress the node index i whenever possible for clarity. To derive 
a convex relaxation of the MML estimation problem ([6]l, we apply the Schur complement identity to the 
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marginalization constraint in Q, yielding 



(9) 



where AA*^ is the complementary set to M. Define the buffer set B <Z M as the set of all variables in J\f 
that have immediate neighbors in the complement J\f^ , 

B ■= {j \ j £M and Ij n AA^ / 0}. (10) 

The difference set between Af and B is referred to as the protected set V. The buffer and protected sets 



are illustrated in Figure 1(a) and 1(b) Due to the Markov property, we have J-pA^c = 0. Decomposing 



M into B and V then reveals the sparsity pattern of K from (|9]): 





[J. 



0) Ja^c R 





Jb,A^c [Ja^c^a^c] ^Jj^c^q 



and hence 



K 



I -1 



(11) 

(12) 



An important observation from ( [TT] ) is that the sparsity pattern of Ja^^a^ is entirely preserved in the 
rows and columns indexed by the protected set V. In particular, because node i itself is always protected, 
the marginal and global concentration matrices share the same values of the row parameters indexed by 
L, which motivates our strategy for fusing local estimates in ([8]l. On the other hand, the sparsity pattern 
in the "buffer submatrix" Kg g is in general modified due to the fill-in term, i.e., the second term in 



(121 



Based on these observations, we now specify a relaxed set of constraints on the marginal concentration 



matrix K. First denote the set of all local edges that are not affected by the fill-in term in ( 12l as 



^Prot .-En {{V xV}u{V X B}U{B X V}} 



(13) 



where the superscript stands for "protected". We then add to E^^°^ all index pairs B x B that could 
potentially be affected by fill-in in ( [T2] ), resulting in a relaxed edge set R (see Figure 1(c) for illustrations): 

i? = ^P™tu{6x B}. (14) 
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In light of ( [TT] ) and ( [T2| ), any feasible marginal concentration matrix K for the MML estimation problem 
([6]) is guaranteed to be supported only on the set R. Therefore the following constraints specify a set 
that contains the feasible set of the MML problem Q: 

K„fc = V(j,fc)^i?, 

(15) 

K >- 0. 



Using the constraints in ( [15) , we formulate the following relaxation of the original MML estimation 
problem ([6]) at each node i: 



K 



s.t. K,,fc = y{j,k)^R, (16) 
K ^ 0. 

The above relaxed MML problem is a convex optimization with respect to K/j and has the same form 
as the global MLE problem (|4]) but with much lower dimensions in general. 

Since the original MML estimation problem Q is difficult to solve, we propose to solve the relaxed 
MML estimation problem ([T6]l as a surrogate to estimate the local parameters. Then a global estimate of 
the concentration matrix can be similarly obtained by extracting and concatenating row parameters as in 
®: 

jRelax ^ j^^^Relax^ for i = 1 , . . . , p. (17) 

C. Asymptotic Consistency 

In this section, we suppress the index i for matrices S and K'^'^'^" for clarity. We prove that the relaxed 
MML estimator defined in (16^ and (Tf) is asymptotically consistent. To do so, we first make use of 



the continuity of the mapping from sample covariance S to local estimate K'^'^'^" in ( [T6] ) to establish the 
following lemma (proof in the Appendix). 

Lemma 1. The mapping from the sample covariance S to the concentration matrix K^'^'"^ through solving 
the relaxed MML problem ( |16| ) is a continuous mapping provided that S is positive definite. 

Theorem 1. The relaxed MML estimator J^*"'"^ is asymptotically consistent. 

Proof of Theorem 1: In the asymptotic limit, the local sample covariance matrix S converges to 
the corresponding sub-matrix of the true covariance matrix, with probability one and is therefore 



positive definite. Let K* = j^) ^ be the corresponding true marginal concentration matrix. We first 



DRAFT 



9 



show that K* is the optimal solution to the relaxed MML estimation problem ( [T6| ) when S = j^. To 
see this, note that the true global concentration matrix J* conforms to the sparsity pattern specified by 
E, and therefore the marginal concentration matrix K* is feasible for ([6]). Furthermore, from relations 
([TT])-([T2l) and our construction of the edge set R, K* also satisfies the sparsity constraints of the relaxed 
problem ([T6]l. It is straightforward to show that K* satisfies the optimality conditions when S = 
Then by strict convexity, K* is the unique optimum of the relaxed MML problem when S = 

It now follows from Lemma [T] and the continuous mapping theorem 1 16 1 that the relaxed MML 
estimates K^'^'^^ converge asymptotically to K* with probability one. Lastly, due to the absence of fill-in 
in ( [TT] ), the construction of the global estimate jJ^'^'^" in ( [TT] ) also yields a consistent estimate of J*. ■ 

It is interesting to note that the true marginal concentration matrix K* is also the optimal solution 
to the MML estimation problem ([6]) when S = SJ^^. This follows because the feasible region for the 
unrelaxed MML problem (with respect to K) is contained in the feasible region for the relaxed problem 
(16 1. However, convergence of the MML estimates K'^'^^ to K* is less clear because the feasible region 
in the unrelaxed problem is not known to be convex. 

Theorem [T] implies that an asymptotically consistent estimate of the global concentration matrix can 
be constructed by solving p local and convex estimation problems without any message-passing between 
neighborhoods. The highly localized nature of the proposed approach stands in sharp contrast to the 
centralized MLE, which requires either iterative global message-passing in triangulated graphs with po- 



tentially large tree-width, or tree-like approximations to avoid non-convergence in loopy graphs 1 17|, 1 18 1. 
Even in the case of chordal or decomposable GGMs, previous methods for estimating the concentration 
matrix require non-trivial message-passing between cliques, with cost that is quadratic in the dimensions 
of the separators, see e.g., [19]. 

While Theorem [T] claims asymptotic consistency for all choices of local neighborhoods containing 
different choices will yield estimators with different properties in the finite sample case, and each 
chosen local neighborhood might result in different consistency rate. i.e. different convergence rates. 
In principle, larger neighborhoods would allow each node to access more data and hence increase its 
information for estimating its local parameters. In the extreme case, if each node has access to all the data 
in the network, the local estimate is equal to the global MLE. On the other hand, larger neighborhoods 
require more parameters to be estimated, thus is tending to result in slower convergence rate towards the 
true parameters. Also the convex relaxation we propose removes the sparsity constraints on the buffer 
sub-matrix of the local concentration matrix. This relaxation would be expected to propagate and influence 
the parameter estimates in the protected edge set E^^°^, and in particular the row parameters that we 
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extract. Larger local neighborhoods would result in larger protected edge sets, which might be expected 
to decrease the effect of the relaxation on the row parameters of interest. 

In sum, the optimal choice of local neighborhoods is unclear due to the above trade-off. In the next 
subsections, we focus on two cases of the relaxed MML estimator with local neighborhood diameters 
corresponding to one and two communication hops through the network. The non- asymptotic behavior 



of the two estimators is studied through numerical experiments in Section IV 



D. Case I: One-hop Estimator 

We first consider a one-hop neighborhood consisting of node i and its immediate neighbors Xj, i.e.. 
Mi = {i,Ti}, Bi = Ti (generically in the worst case), and Vi = {%]. The fill-in term in ([12]) affects the 
submatrix Kj. j., leaving only the first row and column untouched. In this case, since i is by definition 
connected to all elements in Ti, the relaxed edge set Ri defined in ( [T?) ) includes all possible pairs (see 



Figure 1(c) for an illustration): 



The solution to the relaxed MML problem ( [T6| ), where no sparsity is imposed, is simply the inverse 
of the local sample covariance (assuming enough samples for invertibility), 

K^'ii^°P= fsM,^)"'- (18) 



The global estimate is obtained by combining the local one-hop estimates as in ([17 1. 

In the one -hop case, the proposed relaxed MML estimator reduces to the LOC estimator proposed 
in Q. As shown in Q, this estimator is also equivalent to the pseudolikelihood estimator |10| without 



symmetry constraint, and the graphical LASSO estimator |i8J when the graph is known. 

E. Case II: Two-hop Estimator 

We now consider the two-hop version of the estimator, where Mi now includes nodes that are reachable 
from node i within two routing hops. In this setting, the protected set is given by Vi = {i,Ti} (again 
generically) and the buffer set Bi = Mi\Vi consists of all nodes that are exactly two hops away from the 
ith node. Hence Bi can be thought of as the set of second-hop nodes. In the two-hop case, the protected 
edge set E^^°^ includes not only edges between node i and its immediate first-hop neighbors, but also 



edges between first-hop neighbors and between first- and second-hop neighbors (see Figure 1(c) for an 
illustration). 
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Unlike in the one-hop case, the two-hop problem does not admit a closed-form solution in general. A 



global estimate is obtained as before by combining row parameter estimates < \VT} . 

As the problem dimension p increases, if the numbers of non-zero variables in the relaxed local 
problems, \Ri\, are bounded by a constant or a lower-order term compared to \E\, the number of non- 
zero variables in the global problem, then the cost of solving the two-hop relaxed MML problem is 
only marginally higher than the cost of the one-hop problem, while both grow more slowly or not at all 



compared to the cost of the global MLE. In this large-scale setting, as will be seen in Section IV the 
two-hop estimator achieves significantly improved performance relative to the one-hop estimator with a 
modest increase in cost. 

F. Local Averaging 

In practice with finite samples, the local estimates from the relaxed MML problems are not perfectly 
consistent with each other. For example, Jf^''"', which comes from node i's local estimate, may not agree 
with J^f'^", which comes from node j's local estimate. Therefore the resulting global estimate ji^'^'^" in 



( [17] ) is not guaranteed to be symmetric. 

A common way of addressing these discrepancies is to use iterative consensus methods as in Q. 
In this work however, we find that a single round of naive local averaging along edges is sufficient to 
yield a good approximation to the global MLE. Specifically, the local average is given by 

an inexpensive operation in terms of computation and communication. In the one-hop case, the resulting 
symmetric estimator coincides with the AVE estimator proposed in Q. 

IV. Experiments 

In this section, we evaluate and compare the two-hop version of the relaxed MML estimator with local 
averaging (denoted as RelaxedMML 2 hops) with the following estimators: 

• The global ML estimator, denoted as MLE in the legend; 

• The LOCAL and AVE estimators from Q, denoted as LOG and AVE. They coincide with the 
asymmetric and symmetric versions respectively of the one-hop relaxed MML estimator; 

• The weighted maximum pseudo-likelihood estimator using Alternating Direction Method of Multi- 



pliers (ADMM) consensus, proposed in Q and 

2 



LOG 



^1,1 



, which are found to be successful in ||4| 



|3| and denoted as P ML -ADMM. We use the weights 
4I. 



We evaluate these estimators on both synthetic data and a real-world sensor network data set. 
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(a) Normalized MSE for K-NN graphs (p = 500, K ^ 4) (b) Normalized MSE for lattice graphs (p = 20 x 20 = 400, 

II = 0.5, = 0.2) 




Number of samples 



100 110 120 130 140 150 160 170 180 190 200 

Number of samples 



(c) Normalized MSE for small-world graphs (p = 100, K = (d) Normalized MSE for IntelLab Sensor Network Data set f221 
20, /3 = 0.5) ip = 50) 



Fig. 2. Normalized MSE in the concentration matrix estimates for different networks. The legend in Figure [2(d)| applies to all 
plots. The proposed 2-hop relaxed maximum marginal likelihood (MML) estimator clearly improves upon existing distributed 
estimators and nearly closes the gap to the centralized maximum likelihood estimator. 
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A. Synthetic data 

In our experiments, we consider three classes of graphs that are motivated by real-world applications. 
For each class we follow similar experiment settings as in [4]. Specifically, we randomly generate 20 
topologies and concentration matrices J, and for each J, we perform 10 experiments in which random 
samples are drawn from the distribution and the concentration matrix is estimated from the samples. The 
normalized mean squared errors (MSB), defined as 

~' If 



IJ- JP 



\-*\\f 



and averaged over all 200 experiments, are reported in Figure |2] An illustration of the graph topology is 
shown in the top-right comer of each plot. In Figure 2(a) and |2(b)[ the nodes are positioned according to 



the physical layout of the network, while in Figure 2(c) we show a common rendering for a small-world 
graph. The classes of graphs we consider are: 



K-NN graphs (Figure 2(a) i: A K-nearest neighbor graph is a straightforward model for real-world 
networks whose measurements have correlations that depend on pairwise Euclidean distances, e.g., 
sensor networks. For these experiments, we randomly generate p = 500 nodes uniformly over the 
unit square. Each node is then connected to its i^-nearest neighbors, where K = A. The concentration 
matrix is initialized as Jij = ±exp(— 0.5-(ijj) with random sign, where dij is the distance between 
the ith and j'th nodes. Finally we add a small value to the diagonal to ensure positive definiteness. 
Lattice graphs (Figure 2(b)[ ): A lattice graph is appropriate for networks with regular spatial 
correlations. We generate a square lattice graph with p = 20 x 20 = 400 nodes and edge weights 
generated as Jj j = nim{w, 1}, where w is a normally distributed random variable with mean 0.5 
and variance 0.2. A small value is added to the diagonal to ensure positive definiteness. 



• Small-world graphs (Figure 2(c) I: A small-world graph is a useful model for social networks, where 
most nodes have few immediate neighbors but can be reached from any other node through a small 
number of hops. We use the Watts-Strogatz model |20] to generate random small-world networks, 
with p = 100, K{mean degree) = 20, and parameter /3 = 0.5. Under this particular setting, a large 
fraction of nodes have large second-hop neighborhoods with dimension close to p. In general we 
expect the second-hop neighborhood to scale linearly with respect to p. We choose the edge weights 
to be uniformly distributed and also add a small diagonal loading. 
From the experiment results in Figure |2j we can make the following comments. For the graphs that 
have relatively small two-hop neighborhoods, such as the K-NN graphs and the lattice grids, the proposed 
two-hop relaxed MML estimator almost coincides with the global MLE and outperforms other distributed 
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estimators significantly. On the other hand, for graphs such as the small-world networks, the dimensions 
of the two-hop neighborhoods grow as fast as p. In this case, a noticeable gap emerges between the 
global MLE and the two-hop relaxed MML estimator. However, these graphs are known to be harder 
to learn through distributed algorithms [21] and therefore the competing estimators' performance also 
degrades. The two-hop relaxed MML estimator still outperforms by a large margin. 

a) Computational complexity.: In the relaxed MML approach, each local problem has the same 
structure as the centralized ML problem for which there are many efficient algorithms. Furthermore, 
the local problems can all be solved in parallel before the final one-step averaging. Therefore the total 
computational complexity of the proposed algorithm depends on the algorithm used for solving the 
sub-problems as well as the level of parallelization. 

In our experiments, we used the graphical LASSO with known structure | [T5| for solving both the 
centralized ML problem and the local problems in the distributed algorithm. For moderate p (number 
of variables) as in the examples presented here, the total run time of the distributed algorithm with no 
parallelization is comparable to the centralized one. As p grows however, we would expect the complexity 
of the distributed algorithm to scale at most linearly with p (assuming that the local neighborhood 
dimension scales more slowly, such as with K-NN graphs and lattice). The growth is even slower if 
the algorithm can be parallelized. For the centralized algorithm, the dependence of complexity on p is 
expected to be at least linear (the growth is much faster when generic solvers are used). A more severe 
problem is that, most computationally efficient algorithms work in the Lagrangian dual space in which 
the problem usually loses sparsity. Solving the dual problem requires centralized storage and iterative 
updating of a large number of variables, which in turn requires expensive communication among non- 
adjacent nodes in a network setting. A more detailed study of the computational complexity will be 
included in the full version of this paper. 



B. Real-world data 

Finally we apply the estimators to a real-world data set to evaluate their performance and robustness. 



The Lab dataset |22| contains temperature information from a sensor network of 54 nodes deployed in 
the Intel Berkeley Research lab between February 28 and April 5, 2004. This dataset is known to be very 
difficult with many missing data, noise and failed sensors. We select 50 sensors with relatively stable 
and regular measurements. To obtain a target concentration matrix, we use 1800 consecutive samples per 
sensor, interpolate the missing or failed readings and de-trend the data using a local rectangular window of 
10 samples. Next, we compute the sample covariance and invert it to obtain a sample concentration matrix. 
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This concentration matrix is then thresholded to yield a ground truth graphical model with a sparsity 
level of 70% zeros. Using knowledge of the sparsity and sampling from the original 1800 samples, we 
estimate the concentration matrix using the same estimators as before. As shown in Figure 2(d)[ the 



proposed two-hop relaxed MML estimator still gives a very tight approximation to the global MLE and 
its advantage over other distributed estimators is obvious. 

V. Conclusion 

We have proposed a distributed MML framework for estimating the concentration matrix of Gaussian 
graphical models. The proposed method independently solves convex relaxations of marginal likelihood 
maximization problems in local neighborhoods. A global estimate is then obtained by combining the 
local estimates via a single round of local averaging. The proposed estimator is shown to be asymptot- 
ically consistent and computationally efficient. Its improved performance relative to existing distributed 
estimators is illustrated through extensive experiments. 
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Appendix 



Proof: First we note that the relaxed MML estimation problem ( [T6| ) involves only the parameters 
Kfl and Sr because of the sparsity constraints. Since the marginal distribution corresponding to ( [76] ) 
belongs to a minimal exponential family, the mapping from moment (covariance) parameters to 
canonical (concentration) parameters K^*^'"" exists and is unique provided that S is positive definite |J2|. 



The mapping is specified implicitly by the optimality conditions for (iTml, 



Sr- ^K'^'^'^") 'j^ = 0. (20) 

The left-hand side is differentiable with respect to Sr and K^'^''"' and the Jacobian with respect to K^^'''" 
is invertible as long as S is positive definite. Hence by the implicit function theorem, the mapping from 
Si? to K^*^''^" is differentiable and thus continuous. ■ 
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